Rapid Unsupervised Topic Adaptation – a Latent Semantic Approach
نویسندگان
چکیده
In open-domain language exploitation applications, a wide variety of topics with swift topic shifts has to be captured. Consequently, it is crucial to rapidly adapt all language components of a spoken language system. This thesis addresses unsupervised topic adaptation in both monolingual and crosslingual settings. For automatic speech recognition we rapidly adapt a language model on a source language. For statistical machine translation, we adapt a language model of a target language, a translation lexicon and a phrase table using a source text. For monolingual adaptation, we propose latent Dirichlet-Tree allocation for Bayesian latent semantic analysis. Our model enables rapid incremental language model adaptation via caching the fractional topic counts of word hypotheses decoded from previous speech utterances. Latent Dirichlet-Tree allocation models topic correlation in a tree-based hierarchy and thus addresses the model initialization issue. To address the “bag-of-word” assumption in latent semantic analysis, we extend our approach to N-gram latent DirichletTree allocation. We investigate a fractional Kneser-Ney smoothing approach to handle fractional counts for topic models. The algorithm produces a more compact model compared to the Witten-Bell smoothing. Using multi-stage language model adaptation via N-gram latent Dirichlet-Tree allocation, we achieve significant reduction in speech recognition errors using our large-scale GALE systems on two different languages: Mandarin and Arabic. For end-to-end translation on speech inputs, applying topic adaptation on automatic speech recognition is beneficial to translation performance. For crosslingual adaptation, we propose bilingual latent semantic analysis for statistical machine translation. A key feature of bilingual latent semantic analysis is a one-to-one topic correspondence between models of a source and a target language. Since topical information is language independent, our model enables transfer of a topic distribution inferred from a source text to a target language for crosslingual adaptation. Our approach has two advantages: first, it can be applied before translation, and thus has immediate impact on translation. Secondly, it does not rely on an translation output for adaptation, and
منابع مشابه
Unsupervised language model adaptation using latent semantic marginals
We integrated the Latent Dirichlet Allocation (LDA) approach, a latent semantic analysis model, into unsupervised language model adaptation framework. We adapted a background language model by minimizing the Kullback-Leibler divergence between the adapted model and the background model subject to a constraint that the marginalized unigram probability distribution of the adapted model is equal t...
متن کاملLatent Topic Modeling for Audio Corpus Summarization
This work presents techniques for automatically summarizing the topical content of an audio corpus. Probabilistic latent semantic analysis (PLSA) is used to learn a set of latent topics in an unsupervised fashion. These latent topics are ranked by their relative importance in the corpus and a summary of each topic is generated from signature words that aptly describe the content of that topic. ...
متن کاملRecurrent neural network language model adaptation for multi-genre broadcast speech recognition
Recurrent neural network language models (RNNLMs) have recently become increasingly popular for many applications including speech recognition. In previous research RNNLMs have normally been trained on well-matched in-domain data. The adaptation of RNNLMs remains an open research area to be explored. In this paper, genre and topic based RNNLM adaptation techniques are investigated for a multi-g...
متن کاملUnsupervised Latent Speaker Language Modeling
In commercial speech applications, millions of speech utterances from the field are collected from millions of users, creating a challenge to best leverage the user data to enhance speech recognition performance. Motivated by an intuition that similar users may produce similar utterances, we propose a latent speaker model for unsupervised language modeling. Inspired by latent semantic analysis ...
متن کاملBilingual-LSA Based LM Adaptation for Spoken Language Translation
We propose a novel approach to crosslingual language model (LM) adaptation based on bilingual Latent Semantic Analysis (bLSA). A bLSA model is introduced which enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework crosslingual LM adaptation can be performed by, first, in...
متن کامل